Extracting and Annotating Wikipedia Sub-Domains
نویسندگان
چکیده
We suggest a simple procedure for the extraction of Wikipedia sub-domains, propose a plain-text (human and machine readable) corpus exchange format, reflect on the interactions of Wikipedia markup and linguistic analysis, and report initial experimental results in parsing and treebanking a domainspecific sub-set of Wikipedia content. 1 Motivation and a Long-Term Vision Linguistically annotated corpora—for English specifically the Penn Treebank (PTB; Marcus, Santorini, & Marcinkiewicz, 1993) and derivatives—have greatly advanced research on syntactic and semantic analysis. However, it has been observed repeatedly that (statistical) parsers trained on the PTB can drop sharply in terms of parsing accuracy when applied to other data sets (Gildea, 2001, inter alios). Ever since its release, there have been concerns about the somewhat idiosyncratic nature of the PTB corpus (primarily Wall Street Journal articles from the late 1980s), in terms of its subject matter, genre, and (by now) age. Furthermore—seeing the cost of initial construction for the PTB, and its still dominant role in data-driven natural language processing (NLP) for English—design decisions made two decades ago perpetuate (sometimes in undesirable ways) into contemporary annotation work. PropBank (Palmer, Gildea, & Kingsbury, 2005), for example, performs semantic annotation on the basis of PTB syntactic structures, such that a discontinuous ∗This report on work in progress owes a lot to prior investigation by Woodley Packard, who started parsing Wikipedia using the ERG as early as 2003. We are furthermore indebted to Francis Bond, Yusuke Miyao, and Jan Tore Lønning, for their encouragement and productive comments. The WeScience initiative is funded by the University of Oslo, as part of its research partnership with Stanford’s Center for the Study of Language and Information. structure like Gulf received a takeover bid from Simmons of $50 million. (simplified from WSJ0178) leads to an analysis of receive as a four-place relation (with a dubious ARG4-of role). Quite generally speaking, richly annotated treebanks that exemplify a variety of domains and genres (and of course languages other than English) are not yet available. And neither are broadly accepted gold-standard representations that adequately support a range of distinct NLP tasks and techniques. In response to a growing interest in so-called eScience applications of NLP— computationally intensive, large-scale text processing to advance research and education—a lot of current research targets scholarly literature, often in molecular biology or chemistry (Tateisi, Yakushiji, Ohta, & Tsujii, 2005; Rupp, Copestake, Teufel, & Waldron, 2007; inter alios). Due to the specialized nature of these domains, however, many NLP research teams—without in-depth knowledge of the subject area—report difficulties in actually ‘making sense’ of their data. To make eScience more practical (and affordable for smaller teams), we propose a simple technique of compiling and annotating domain-specific corpora of scholarly literature, initially drawing predominantly on encyclopedic texts from the community resource Wikipedia.1 Adapting the Redwoods grammar-based annotation approach (Oepen, Flickinger, Toutanova, & Manning, 2004) to this task, we expect to construct and distribute a new treebank of texts in our own field, Computational Linguistics, annotated with both syntactic and (propositional) semantic information—dubbed the WeScience Treebank. Should this approach prove feasible and sufficiently cost-effective, we expect that it can be adapted to additional document collections and genres, ideally giving rise—over time—to an increased repository of ‘community treebanks’, as well as greater flexibility in terms of goldstandard representations. In an initial experiment, we gauge the feasibility of our approach and briefly discuss the interaction of (display) markup and linguistic analysis (§ 2 and § 3); we further report on a very preliminary experiment in sentence segmentation, parsing, and annotation (§ 4), and conclude by projecting these into an expected release date for (a first version) of WeScience. 2 Wikipedia—Some Facts and Figures Wikipedia represents probably the largest and most easily accessible body of text on the Internet. Under the terms of the open-source GNU Free Documentation License, Wikipedia content can be accessed, used, and redistributed freely. Wikipedia In 2007 and 2008, the interest in Wikipedia content for NLP research has seen a lively increase; see http://www.mkbergman.com/?p=417 for an overview of recent Wikipedia-based R&D, most from a Semantic Web point of view. to date contains more than 1.74 billion words in 9.25 million articles, in approximately 250 languages. English represents by far the largest language resource, with more than 1 billion words distributed among 2,543,723 articles.2 Its size, hypertext nature, and availability make Wikipedia an attractive target for NLP research. Wikipedia’s editing process distinguishes it from most other documents that are available on-line. An article may have countless authors and editors. In April 2008, the English Wikipedia received 220,949 edits a day, with a total of 175,884 distinct editors that month. Wikipedia text provides relatively coherent, relatively high-quality language, but it inevitably also presents a comparatively high degree of linguistic (including stylistic) variation. It is thus indicative of dynamic, community-created content (see below). 2.1 Domain-Specific Selection of Text Our goal in the WeScience Treebank is to extract a sub-domain corpus, targeting our own research field—NLP. To approximate the notion of a specific sub-domain in Wikipedia (or potentially other hyper-linked electronic text), we start from the Wikipedia category system—an optional facility to associate articles with one or more labels drawn from a hierarchy of (user-supplied) categories. The category system, however, is immature and appears far less carefully maintained than the articles proper. Hence, by itself, it would yield a relatively poor demarcation of a specific subject area. For our purposes, we chose the category Computational Linguistics and all its sub-categories—which include, among others, Natural Language Processing, Data Mining and Machine Translation—to activate an initial seed of potentially relevant articles. Altogether, 355 articles are categorized under Computational Linguistics or any of its sub-categories. However, some of these articles seemed somewhat out-of-domain (see below for examples), and several are so-called stub articles or very specific and short, e.g. articles about individual software tools or companies. It was also apparent that many relevant articles are not (yet) associated with either of these categories. To compensate for the limitations in the Wikipedia category system, we applied a simple link analysis and counted the number of cross-references to other Wikipedia articles from our initial seed set. By filtering out articles with a comparatively low number of cross-references, we aim to quantify the significance (of all candidate articles) to our domain, expecting to improve both the recall and precision of sub-domain extraction. Given the highly dynamic nature of Wikipedia, these statistics evolve constantly. We report on the stable ‘release’ snapshot dated July 2008, which also provides the starting point for our sub-domain extraction. See http://en.wikipedia.org/wiki/Wikipedia for up-to-date statistics on Wikipedia. Among the articles that were filtered out from our original set of seed articles, we find examples like AOLbyPhone (0 references) and Computational Humor (1 reference). New articles, differently categorized, were activated based on this approach. These include quite prominent examples like Machine learning (34 references), Artificial intelligence (33 references) and Linguistics (24 references). Of the 355 seed articles, only 30 articles remain in the final selection. Confirming our expectations, filtering based on link analysis eliminated the majority of very narrowly construed articles, e.g. specific tools and enterprises. However, our link analysis and cross-reference metric also activates a few dubious articles (in terms of the target sub-domain), for example United States (8 references). We have deliberately set up our sub-domain extraction approach as a fully automated procedure so far, avoiding any elements of subjective judgment. However, we expect to further refine the results based on feedback from the scientific community. To suppress the linguistically less rewarding stub articles, we further applied a minimum length threshold (of 2,000 characters, including markup) and—using a minimum of seven incoming cross-references—were left with 100 Wikipedia articles and approximately 270,000 tokens. 2.2 Spelling Conventions and Stylistic Variation English Wikipedia does not conform to one specific (national) spelling convention, but according to the guidelines for contributors and editors, the language within an article should be consistent with respect to spelling and grammar (thus, centre and center should not be used side-by-side). Articles are not tagged according to which variant of English they use, and we have yet to gather statistics on the distribution of English variants. In our view, the WeScience Treebank reflects the kind of content that is growing rapidly on the Internet, namely community-created content. In such content one will typically expect more variation in style (and quality), more spelling mistakes and ‘imperfect’ grammar, as well as some proportion of text produced by nonnative speakers. The WeScience Treebank may provide a useful point of reference for the study of the distribution of such phenomena in community-created content.
منابع مشابه
An Approach for Deriving Semantically Related Category Hierarchies from Wikipedia Category Graphs
Wikipedia is the largest online encyclopedia known to date. Its rich content and semi-structured nature has made it into a very valuable research tool used for classification, information extraction, and semantic annotation, among others. Many applications can benefit from the presence of a topic hierarchy in Wikipedia. However, what Wikipedia currently offers is a category graph built through ...
متن کاملEstimating Credibility of News Authors from their WIKI Validated Predictions
In this paper, we consider a set of articles or reports by journalists or others, wherein they predict or promise something about future. The problem we approach is determining the credibility of the authors based on the predictions coming out to be true. The two specific problems we address are extracting the predictions from the articles and annotating with various prediction attributes. And ...
متن کاملHarvesting Domain-Specific Terms using Wikipedia
We present a simple but effective method of automatically extracting domain-specific terms using Wikipedia as training data (i.e. self-supervised learning). Our first goal is to show, using human judgments, that Wikipedia categories are domainspecific and thus can replace manually annotated terms. Second, we show that identifying such terms using harvested Wikipedia categories and entities as s...
متن کاملClassifying articles in English and German Wikipedia
Named Entity (NE) information is critical for Information Extraction (IE) tasks. However, the cost of manually annotating sufficient data for training purposes, especially for multiple languages, is prohibitive, meaning automated methods for developing resources are crucial. We investigate the automatic generation of NE annotated data in German from Wikipedia. By incorporating structural featur...
متن کاملExtracting Trust from Domain Analysis: A Case Study on the Wikipedia Project
The problem of identifying trustworthy information on the World Wide Web is becoming increasingly acute as new tools such as wikis and blogs simplify and democratize publications. Wikipedia is the most extraordinary example of this phenomenon and, although a few mechanisms have been put in place to improve contributions quality, trust in Wikipedia content quality has been seriously questioned. ...
متن کامل